Our data set is a sample set of accidents and related accident data compiled by the National Highway Traffic Safety Administration (NHTSA) 2018. Our specific data set is comprised of data from the Crash Report Sampling System(CRSS). This data is a sample of police reported accident data relating to many different aspects of an accident, including pedestrian, motor vehicle, property, etc. Every year, there are more than 6 million reported accidents, and this data set compiles data on accidents that are “of greatest concern to the highway safety community and the general public”.
Our objective is to gain a deeper insight into different aspects of accident trends. We have focused on a few different areas of accidents, including alcohol use as it relates to other data, area of the US, and others.
The following graphs illustrate the accident trends for the hour of the day and for the day of the week. The tendency is for most of the accidents happening during typical commuting times, with the lowest amount during the overnight hours (between midnight and 5am). The distribution of accidents per day of week is relatively uniform with the highest amounts occurring on Friday and lower numbers of overall accidents on the weekends. Presumably, this tendency for higher accidents during the week is due to more overall traffic due to commuting.
When looking at the amount of alcohol involved in accidents, it was an overall small percentage compared with the overall number of accidents. However, when looking closer at just the accidents involving alcohol, certain trends started to become clearer.
It is hard to see the significance of alcohol when comparing to the overall amount of accidents in our data. However, we can see some trends emerge, wherein the proportion of alcohol involved in accidents increases on the weekend days and Friday, as well as in the evening and overnight hours.The Central Limit Theorem (CLT) is a theorem that states that the distribution of sample means of a certain sample size of a population will be, in most cases, a normal distribution. As the sample size increases, the distribution will increasing have a normal distribution. We tried this out with the age of the driver from the PERSON dataset, and the theorem held true. As we increased the sample size, the distribution increasingly became normal.
## [1] "Population mean: 37.3305331448058 Population SD: 19.0429683609092"
## [1] "Population mean: 37.33 Population SD: 19.04"
## [1] "Sample size: 10 Sample Mean: 37.28 Sample SD 6.03"
## [2] "Sample size: 20 Sample Mean: 37.26 Sample SD 4.25"
## [3] "Sample size: 30 Sample Mean: 37.36 Sample SD 3.45"
## [4] "Sample size: 40 Sample Mean: 37.37 Sample SD 3.05"
In this pie chart we can see the differences in number of injuries from each different type of severities. The biggest type of injuries classification is the possible injuries, where the injuries could be none to below minor. The second most injury is the minor injury, such as bruise and scrapes. The third is serious injury, where it is possible to be life-threatening and need hospital treatment. The least type of injury to occurred is fatal, which have higher chance of ending the life of the victim.
Urbanicity data split between two values, “Urban” and “Rural” accident cases. And we can see from the total accidents data, that urban area have higher accident cases than rural area. This is expected, because urban areas have more traffic than rural area, especially in rush hours. The data concludes that total accidents in rural area only consists of around 30% from the total accidents in the urban area.
Between the severities and urbanicities, we could see that all types of severities are more prominent in the urban area, while the rural area have less accidents overall. From the data we could conclude that each severities only have around 10% - 25% compare to urban accidents. And each severities-urban accidents are size appropriate to the total accidents, with the no-injuries sit at the highest, follow by possible-injury, then minor-injury, serious-injury, and fatal-injury.
We are using 3 different sampling methods for the injuries severities data. The first sampling method is simple random sampling without replacement, where each severity have a chance of being selected as the other severities. The second one is systematic sampling, where the rule used is the total number of cases on each of these severities, divided by the number of sampling size we chose (in this case is 1000). And the final is stratified sampling, where the severities will be segmented into stratas based on the desired sample size (also 1000).
## Stratum 1
##
## Population total and number of selected units: 24354 18.39134
## Stratum 2
##
## Population total and number of selected units: 10564 144.5395
## Stratum 3
##
## Population total and number of selected units: 6861 513.0614
## Stratum 4
##
## Population total and number of selected units: 873 222.5499
## Stratum 5
##
## Population total and number of selected units: 4816 101.4578
## Number of strata 5
## Total number of selected units 1000